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ABSTRACT 

Summary: Hypersensitivity to DNasel digestion is a hallmark of open 
chromatin, and DNasel-seq allows the genome-wide identification of 
regions of open chromatin. Interpreting these data is challenging, lar- 
gely because of inherent variation in signal-to-noise ratio between 
datasets. We have developed PeaKDEck, a peak calling program 
that distinguishes signal from noise by randomly sampling read den- 
sities and using kernel density estimation to generate a dataset- 
specific probability distribution of random background signal. 
PeaKDEck uses this probability distribution to select an appropriate 
read density threshold for peak calling in each dataset. We benchmark 
PeaKDEck using published ENCODE DNasel-seq data and other peak 
calling programs, and demonstrate superior performance in low 
signal-to-noise ratio datasets. 

Availability and implementation: PeaKDEck is written in standard 
Perl and runs on any platform with Perl installed. PeaKDEck is also 
available as a standalone application written in Perl/Tk, which does not 
require Perl to be installed. Files, including a user guide, can be down- 
loaded at: www.ccmp.ox.ac.uk/peakdeck. 
Contact: chris.ocallaghan@ndm.ox.ac.uk 

Supplementary information: Supplementary data are available at 
Bioinformatics online. 
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1 INTRODUCTION 

DNasel hypersensitivity analysis can be used to map sites of 
open chromatin in genomic DNA (Wu, 1980). Hypersensitivity 
of DNA to digestion by DNasel arises when nucleosomal his- 
tone proteins are displaced from chromatin, leaving a region of 
'naked' nucleosome-free DNA that is accessible to the DNasel 
enzyme (Owen-Hughes and Workman, 1996). Histone displace- 
ment and consequent DNasel hypersensitivity characteristically 
occur at promoter and enhancer sites (Song et al., 2011), allow- 
ing the sequence-specific binding of proteins, such as transcrip- 
tion factors to the DNA. Recently, advances in high-throughput 
sequencing methods have been applied to DNasel hypersensitiv- 
ity testing [DNasel-seq; Supplementary Information, Section SI; 
(Boyle et al, 2008a; Hesselberth et al, 2009)]). With DNasel-seq, 
regulatory DNA fragments at accessible open chromatin sites are 
released by 'two-hit' digestion. These fragments are sequenced 
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using high-throughput technology and mapped back to the ref- 
erence genome. 

The comparison of DNasel hypersensitivity patterns in differ- 
ent datasets can play an important role in the study of gene 
regulation (Sheffield et al., 2013), for example, in response to a 
physiological stimulus. However, a major challenge in analyzing 
these data is the variation in signal-to-noise ratio (SNR) between 
datasets (Supplementary Table SI). While many potential 
sources of noise exist, a key variable affecting the SNR is the 
enzymatic activity of DNasel, which is difficult to control be- 
tween experiments. Variation in the level of DNasel activity 
leads to different amounts of digestion (Supplementary 
Fig. S3), altering the composition of the population of short 
DNA fragments that are sequenced. There is no universal surro- 
gate measurement of digestion that can be used to accurately 
quantify digestion at the library preparation stage and no ideal 
control sample (discussed in Supplementary Information, 
Section S2). For these reasons, distinguishing signal from noise 
in a manner that allows comparison between datasets is more 
challenging with DNasel-seq than with other high-throughput 
sequencing approaches, such as ChlP-seq. 

Several peak-calling programs have been developed for use 
with high-throughput sequencing data. Most focus on ChlP- 
seq where a clear input control is available, but some have also 
been used for DNasel-seq data, including F-seq (Boyle et al., 
2008b), MACS (Zhang et al., 2008) and HOMER (Heinz 
et al., 2010). While analyzing our own DNasel-seq data, we 
found variable performance between peak callers, particularly 
at low SNRs, and the identification of a suitable peak threshold 
was challenging. This confounded the comparison between data- 
sets with different SNRs (see Supplementary Information, 
Section S4 for SNR estimation). 

We have developed a peak-calling algorithm (PeaKDEck) that 
limits the effect of SNR on threshold setting, which is of particu- 
lar value in datasets with lower SNRs. We have used Hotspot- 
identified (John et al., 201 1) DNasel-seq sites from 125 cell types 
published by ENCODE (Thurman et al., 2012) to compare the 
quality of peak calling by PeaKDEck with that by other peak 
calling programs. PeaKDEck also includes additional tools for 
DNasel-seq data analysis (see Supplementary Information, 
Section S9 for description of additional tools). 

2 PEAK CALLING 

The method of peak calling used by PeaKDEck is illustrated in 
Supplementary Figure S12. First, 50000 sites are selected 
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Fig. 1. (A) PeaKDEck uses a sampling bin to measure signal at any given 
locus. PeaKDEck calculates the corrected read density by first counting 
the number of read start sites (green) within a central bin (e.g. Five read 
start sites in a bin of size 300 bp). Next, the read density in a larger 
background bin is measured (e.g. 10 reads in a bin of size 3000 bp). 
Based on this background read density, the expected read density in a 
bin of central bin size is calculated (e.g. 10 reads in 3000 bp, giving an 
expected read density of 1 read in 300 bp) and subtracted from the central 
bin read density to give the corrected read density (4 in this example). (B) 
We calculated the percentage of unique sites identified by four different 
peak callers in each of 10 sample datasets, and color-coded each dataset 
based on the SNR from blue (low SNR) to red (high SNR). For datasets 
with low SNR, PeaKDEck had the lowest percentage of unique peaks out 
of the four peak callers 

randomly from the genome, and overlapping sites are discarded 
(Supplementary Fig. S12A). Next, the signal strength is mea- 
sured at the non-overlapping sites using sampling bins (Fig. 1A 
and Supplementary Fig. S12B). This is achieved by measuring 
the background read density in a large bin surrounding the site 
('background read density'; default — 3000 bp), and then measur- 
ing the read density in a smaller focused bin at the same site 
('central read density'; default — 300 bp). The corrected read 
density is calculated by subtracting the expected read density 
(given the background read density) from the central read 
density. 

Once this calculation has been repeated for all the randomly 
selected sites in the dataset, a probability distribution is gener- 
ated to describe the distribution of these corrected read density 
scores (Supplementary Fig. S12C). Because the distribution of 
these scores is typically non-Gaussian, PeaKDEck uses kernel 
density estimation (Supplementary Information, Section S5) to 
calculate a probability distribution for the randomly selected 
corrected read density scores, where n is the number of sites 
sampled, h is the bandwidth (h= 1), x,- is the value of x for the 
rth site and K„ is a Gaussian kernel: 
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To identify a threshold for peak calling, PeaKDEck calculates 
the probability that a given corrected read density belongs to the 
background probability distribution for increasing values of 
corrected read density. The corrected read density threshold 
is when the calculated probability drops below a pre-determined 



level (default — P< 0.001). The entire dataset is then scanned in 
overlapping sampling bins and the corrected read density deter- 
mined across the genome (Supplementary Fig. S12D). Peaks are 
called where the corrected read density exceeds the threshold. 
Peaks can be scored by their maximum corrected read density 
or probability score. 



3 PERFORMANCE 

We assessed the performance of PeaKDEck by calling peaks in 10 
DNasel-seq datasets from the NCBI Short Reads Archive 
(Supplementary Information, Section S3). To determine whether 
the sites identified by PeaKDEck as open chromatin were known 
open chromatin sites, we amalgamated 125 ENCODE DNasel- 
seq datasets for different cell types, tagging each genomic locus 
with the number of cell types with open chromatin at that site 
(Supplementary Fig. S5). For each of the 125 datasets, we calcu- 
lated the percentage of open chromatin sites unique to that data- 
set, the percentage of sites shared with one other cell type, 
continuing up to the percentage of sites per dataset shared 
across all 125 cell types. The mean percentage of unique peaks 
per dataset was 3.61 ±4.13% (± standard deviation). For the 
peaks called by PeaKDEck in our 10 sample DNasel-seq datasets, 
the mean percentage of unique peaks per dataset was 4.6 ± 1.6% 
(± standard deviation; see Supplementary Information, Section S6 
and S7 for details). This demonstrates that the overlap between 
open chromatin sites identified by PeaKDEck and known open 
chromatin sites is within the normal range of variation observed in 
the ENCODE data. 

Because PeaKDEck adjusts signal measurement to account for 
local variation in read densities and extensively samples back- 
ground signal in individual datasets, PeaKDEck performs well at 
setting thresholds in low SNR datasets. To demonstrate this, we 
called peaks in the 10 sample NCBI DNasel-seq datasets with 
PeaKDEck, MACS, FSEQ and HOMER (Supplementary 
Fig. SI 3) and quantified the number of unique sites (not occur- 
ring in the ENCODE-125 dataset) as a percentage of the total 
identified sites in each dataset, with each peak caller (Fig. IB). In 
the dataset with lowest SNR, 6.95% of the total peaks identified 
by PeaKDEck were unique, compared with 7.38, 9.64 and 31.4% 
of peaks identified by MACS, Homer and FSEQ, respectively, 
suggesting that PeaKDEck is more likely to identify authentic 
open chromatin sites even at low SNRs compared with other 
available peak callers. 

Although PeaKDEck is designed for use in DNasel-seq data 
analysis (using the read start site as the point of interest), it can 
be used for similar methods such as chromatin immunoprecipi- 
tation sequencing and FAIRE-seq, by applying a user-defined 
offset to calculated genomic positions. PeaKDEck is especially 
useful compared with other peak callers where SNR is low (see 
Supplementary Information, Section S8). 
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